library(dplyr)
library(readr)
library(kableExtra)
library(ggplot2)
library(maps)
library(tidyr)
library(tidyverse)
library(cluster)
library(factoextra)
library(dendextend)
library(ROCR)
library(glmnet)
library(caret)
library(tree)
library(maptree)
library(ggridges)
library(viridis)
library(hrbrthemes)
library(IRdisplay)
Predicting Voter behavior is a very difficult task that takes into consideration lots of variables that may change quickly, also known as "chaos", skewing your results.Polls are essential to voter behavior predictions, but only if they are done correctly with random sampling and large sample sizes. In addition, you are relying on polls that are at risk of bias with low response rates, inaccurate polling, and that they do not meausure voter turnout. Another reason voter behavior prediction is hard is because by the time you complete your analysis, the variables and their relations to each other may have already changed. This can lead to a completely different result than what's expected, misleading the audience. This is espcially crucial to Campaign leaders who need to comprehend what areas they need to focus on as their time is extremely valuable. Lastly, how infrequent elections occur makes it difficult for statisticians to be able to practice and perfect models to accurately predict elections. Occurring every 4 years with, about 1 year of strong campaigning, it's hard to collect data that will accurately portray how election night will go.
However despite all the implications of complications of predicition of election results there were some methods whose predictions of results fell very close to the actual ones. In particular methods of prediction used by Nate Silver provided predictions that fell very close to the actual results. Silver was able to model voting behavior and correctly predict the election result for all 50 states. He utilize hierarchical modeling in order to take into account and analzye the state and national level voting together. He modeled an observed variable of the intended voting behavior in each state which he then used in order to predict the voting results. What made his model stand out from others was that when analyzing poll percentages, actual percentage + the house effect + sampling variation = poll percentage, he utilized a different approach in calculating the probability when "more variation is less likely" to determine which actual percentage has the greatest probability. Rather than examining the maximum probability, Silver calculated the porbability of support for each day and then use the model for the next day to ask what is the probability that support changed from x% to x%. He achieved this by using Bayes' Theorem and graph theory to determine new probabilities of each level of support. With his probabilities of each level of support, he was able to incorporate hierarchical modeling to produce predictions for each level of estimated support on the state and national level on any day leading up to the election.
So what went wrong with the prediction results in 2016 that placed Clinton as a clear favorite? The inaccurate predictions can be attributed to poor polling, misinterpretations and mediocre analysis, and weak election models. In 2016, there was inadequate state polling that were relied on in order to develop models or create prediction forecasts. Because of this, we saw a sweep of all national polls missing in the same direction. Weak election models are also believed to be of blame as it could've caused some forecasting erros as there are so many variables and small changes that can result in big changes. In order to ensure we have better predictions in the future I believe that polling has to be a top priority in that they are correctly receiving data with random sampling, large sample sizes, paying extra attention to swing states in making sure those models are correct, and develop methods to help predict voter turnout as well as having an adequate turnout on election day. Analysis on the weights of certain factors, such as if the candidate is an incumbent, would also be helpful such that statisticians would be able to develop models that accurately portray voter behavior change.
election.raw <- read_delim("election.csv", delim = ",") %>% mutate(candidate=as.factor(candidate))
census_meta <- read_delim("metadata.csv", delim = ";", col_names = FALSE)
census <- read_delim("census.csv", delim = ",")
The Data that we are running an analysis is such that each observation consists of five main characterstics:
1.county: the name of the county where the voting was done
2.fips: fips also know as the Federal Information Processing Standard Publication 6-4(FIPS 6-4) is a five digit
Federal Information Processing Standards code which uniquely identified counties.The five-digit code uses the two digits of FIPS state code followed by the three digits of the county code within the state.
3.candidate: the candidate name of the candidate that was voted for in that particular county
4.state: the state abbreviation of the state where the county is located and where the voting took place.
5.votes: the number of times that a candidate was voted for in that county
Each observation is expressed as a row in a table an each characteristic is set as a column. We can glimpse the data taking for example Los Angeles county.
#kable(election.raw %>% filter(county == "Los Angeles County")) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
election.raw %>% filter(county == "Los Angeles County")%>% kable('html')%>% as.character()%>%
display_html()
We can observe the number of times that Los Angeles county voted for each presidential candidate, with the candidates identified by name , the votes expressed as integers, the state expressed as an abbreviation and the county identified by fips code.
Further there are some rows in our data that are summary rows, these rows contain NA values for the county column: There are two types of summary rows:
1.Federal-level summary rows have fips value of US.
2.State-level summary rows have names of each states as fips value.
#kable(head(election.raw %>% filter(is.na(county)==TRUE),10)) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
#kable(tail(election.raw %>% filter(is.na(county)==TRUE),10)) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
head(election.raw %>% filter(is.na(county)==TRUE),10)%>% kable('html')%>% as.character()%>%
display_html()
tail(election.raw %>% filter(is.na(county)==TRUE),10)%>% kable('html')%>% as.character()%>%
display_html()
Above we can see a glimpse of the summary observations, where if we have a summary for a state level we have an NA value for county and the fips column contains the state abbreviation. And if it is a federal summary we have an NA value for the county and the fips column contains US. That is if an observation is a summary it has a NA value for the county and a non-numerical valu for the fips code.
Before proceeding with our analysis of the data it is important to deal with any occurrence of faulty or flawed data. A further inspection of the data reveals a problem. It is apparent that for values of 2000 for the fips column, that the respective county is not available, instead in its place is a value of NA.
#kable(election.raw %>% filter(fips==2000)) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
election.raw %>% filter(fips==2000)%>% kable('html')%>% as.character()%>%
display_html()
On first thought it seems that it would seem reasonable to look up the county that corresponds to the fips code of 2000, but looking up the county for the fips code 2000 results in the realization that this code does not exist. On second thought it would seem reasonable that if we had enough information on the state of AK that we can deduce to which county these values pertain to so we inspect the available date for AK:
#kable(election.raw %>% filter(state=='AK')) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
election.raw %>% filter(state=='AK')%>% kable('html')%>% as.character()%>%
display_html()
We observe that the only available data that we have for Alaska is the state summary and the data for where the county code is 2000. With this information we cannot reasonably deduce to which county this data belongs to. Therefore the best practice is to remove the data where the county code is 2000.
dim(election.raw)
The dimensions of our data before removing are 18351 rows and 5 columns.
election.raw<-election.raw[!(election.raw$fips==2000),]
dim(election.raw)
After removing the observations where the county code is 2000 the dimensions for the data are 18345 rows and 5 columns.
For the purposes of our analysis it is practical to remove federal and state level summaries of election voting. These will interfere with the analysis that we want to conduct. Our current dimensions inclusive of state and federal summaries are 18345 rows and 5 columns.
election_federal <- election.raw %>% filter(fips == 'US' & is.na(county))
election_state <- election.raw %>% filter(fips != "US" & is.na(county))
election <- election.raw %>% filter(fips != "US" & as.character(fips) != as.character(state))
election.raw<-election.raw[!(is.na(election.raw$county)==TRUE),]
dim(election.raw)
After removing all summaries(federal and state) the dimensions of our data are now 18011 rows and 5 columns.
In the 2016 election there were 31 candidates for the presidential election. We can run an initial visualization on the votes received by each the candidates, to get a senese of the outcome of the election and the distribution of votes.
options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)
ggplot(election_federal, aes(x = candidate, y = votes/1000000)) +
geom_bar(stat = "identity") + coord_flip() + scale_y_continuous() + ylab("Votes (in millions)")+
theme(text = element_text(size = 30), element_line(size = .1))
For simplicity of analysis we create county_winner and state_winner by taking the candidate with the highest proportion of votes.To create county_winner we start with election, group by fips, compute total votes, and create variable pct = votes/total. Then choose the highest row using top_n. We follow a similar approach with state_winner arriving at an aggregated column containing the winner for every state.
county_group <- group_by(election, fips)
county_total <- summarize(county_group, total = sum(votes))
county.group <- left_join(county_group, county_total, by = "fips")
county_pct <- mutate(county.group, pct = votes/total)
county_winner <- top_n(county_pct, n =1)
state_group <- group_by(election_state, state)
state_total <- summarize(state_group, total = sum(votes))
state.group <- left_join(state_group, state_total, by = "state")
state_pct <- mutate(state.group, pct = votes/total)
state_winner <- top_n(state_pct, n= 1)
Visualization is crucial for gaining insight and intuition during data mining. We will map our data onto map in order to have visual representations of the results on the county level and on the state levels.
The R package ggplot2 can be used to draw maps and we make use of it to create a map visualization of our data.
The variable states contain information to draw white polygons, and fill-colors are determined by region. We great an initial visualization of the U.S counties and create a visual map of the counties colored by the counties by using counties = map_data("county").
counties=map_data("county")
options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)
ggplot(data = counties) +
geom_polygon(aes(x = long, y = lat, fill = subregion, group = group), color = "white") +
coord_fixed(1.3) +
guides(fill=FALSE)+ # color legend is unnecessary and takes too long
theme(text = element_text(size = 30), element_line(size = .1))
Now we color the map by the winning candidate for each state. We'll combine the two datasets based on state name. However, the state names are in different formats in the two tables: e.g. AZ vs. arizona. Before using left_join(), we create a common column by creating a new column for states named fips=state.abb[match(states$region, casefold(state.name) )]. We then combine states variable and state_winner we created earlier using left_join(). A call to left_join() takes all the values from the first table and looks for matches in the second table. If it finds a match, it adds the data from the second table; if not, it adds missing values. After doing this we visualize our results:
states <- map_data("state")
states <- states %>%
mutate(fips=state.abb[match(states$region, casefold(state.name) )])
states <- left_join(states, state_winner, by="fips")
options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)
ggplot(data = states) +
geom_polygon(aes(x = long, y = lat, fill = candidate, group = group),colour="white") +
coord_fixed(1.3)+
guides(fill=guide_legend(title="Candidate"))+theme(text = element_text(size = 30), element_line(size = .1))
Where a red signifies that Donald Trump won that state and where a blue signifies that Hilary Clinton won.
The variable county does not have fips column. So we will create one by pooling information from maps::county.fips. We split the polyname column to region and subregion,use left_join() combine county.fips into county and left_join() previously created variable county_winner. Finally we visualize the election results on a county level colored by winner for each county.
oldw <- getOption("warn")
options(warn = -1)
county_split=separate(maps::county.fips,polyname,c("region", "subregion"),sep="," )
county_combined=left_join(county_split,counties,by=c("region", "subregion"))
county_combined$fips=as.factor(county_combined$fips)
county_new=left_join(county_combined,county_winner)
options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)
ggplot(data = county_new) +
geom_polygon(aes(x = long, y = lat, fill = county_new$candidate, group = group),colour="white" ) +
coord_fixed(1.3)+
guides(fill=guide_legend(title="Candidate"))+theme(text = element_text(size = 30), element_line(size = .1))
options(warn = oldw)
Where a red signifies that Donald Trump won that state and where a blue signifies that Hilary Clinton won.
It would be interesting to visualize the poverty percentage by state, given that the United States holds very varied views on social-economic systems that are often thought to be linked to economic status of groups. We can then visualize the percentage of povery for each state.
options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)
ggplot(data = census, aes(y = as.factor(State), x = Poverty, fill = ..x..)) +
geom_density_ridges_gradient(scale = 4, rel_min_height = 0.01) +
scale_fill_viridis_c(name = "Poverty Percentage", option = "C") +
ggtitle('Poverty Percentage by State') +
ylab("State") + xlab("Poverty Percentage")+
theme(text = element_text(size = 30), element_line(size = .1))
This ridgeplot illustrates the poverty percentage of each state as most states hover around the range of 8%-12%.
It would also be interesting to visualize the percentage of the black population in each state given that social issues surrounding the black community have gained a greater amount of political traction and that this issues are commonly associated with social ideas on which political ideologies differ greatly.
options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)
ggplot(data = census, aes(y = as.factor(State), x = Black, fill = ..x..)) +
geom_density_ridges_gradient(scale = 4, rel_min_height = 0.01) +
scale_fill_viridis_c(name = "Black Population Percentage", option = "C") +
ggtitle('Black Population Percentage by State') +
ylab("State") + xlab("Black Population Percentage")+theme(text = element_text(size = 30), element_line(size = .1))
This plot shows the Black population percentage per state as the vast majority of states possess a very low percentage under 5%.
It would also be interesting to see a visual representation of the white population given that this is the predominant population in the United States, and in which a great deal of the election results are attributed to despite the amount of variance that is present in this demographic.
census1=census %>% group_by(State) %>% summarise(White_Pct=mean(White,na.rm=TRUE))
options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)
ggplot(census1,aes(x=State,y=White_Pct)) +
geom_bar(stat="identity")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ggtitle("Percentage of White population in each State") +
ylab("White Population Percentage")+theme(text = element_text(size = 50), element_line(size = .1))
We can observe that in the 46 states, the White population comprises of the majority population in regards to race/ethnicity.
The census data contains high resolution information (more fine-grained than county-level).We will aggregate the information into county-level data by computing TotalPop-weighted average of each attributes for each county. We can accomplish this in accordance to our analysis by first cleaning the census data census.del starting with the attribute census by filter out any rows with missing values. We then convert {Men, Employed, Citizen} attributes to percentages (meta data seems to be inaccurate) and compute Minority attribute by combining {Hispanic, Black, Native, Asian, Pacific} and after creating Minority we remove {Walk, PublicWork, Construction}.
Now we create census.subct starting with census.del from above and group_by() two attributes {State, County}, use add_tally() to compute CountyTotal and compute the weight by TotalPop/CountyTotal.
Finally we create census.ct starting with census.subct and use summarize_at() to compute weighted sum.
census.del <- na.omit(census) %>% mutate(Men = Men/TotalPop*100,
Employed = Employed/TotalPop*100,
Citizen = Citizen/TotalPop*100,
Minority = Hispanic+Black+Native+Asian+Pacific) %>%
select(-Women, -Hispanic, -Native, -Black, -Asian, -Pacific, -Construction,-Walk, -PublicWork) #census.del <- census.del[,c(1:6,29,7:28)] # reordering (want minority next to white)
census.subct<-census.del%>%
group_by(State,County) %>%
add_tally(TotalPop) %>%
mutate(CountyTotal=n)%>%
mutate(Weight=TotalPop/CountyTotal) %>%
select(-n)
census.ct <- census.subct %>% summarize_at(vars( Men, White, Citizen, Income, IncomeErr, IncomePerCap, IncomePerCapErr, Poverty, ChildPoverty, Professional, Service, Office, Production, Drive, Carpool, Transit, OtherTransp, WorkAtHome, MeanCommute, Employed, PrivateWork, SelfEmployed, FamilyWork, Unemployment, Minority, CountyTotal), funs(weighted.mean(., Weight)))
Observing the first few rows of census.ct this is what our data looks like.
head(census.ct[,1:7])%>% kable('html')%>% as.character()%>%
display_html()
head(census.ct[,8:14])%>% kable('html')%>% as.character()%>%
display_html()
head(census.ct[,15:21])%>% kable('html')%>% as.character()%>%
display_html()
head(census.ct[,22:27])%>% kable('html')%>% as.character()%>%
display_html()
We would like to run the principal component analysis in order to reduce the dimensionality of our data to a basis of attributes that are most likely to characterize and distinguish observations into classifications through the variance of the attributes. We can begin by looking at the variance of our data on the county level and on the sub county leve, before running principal component analysis.
census_var<-data.frame(apply(census.ct[sapply(census.ct,is.numeric)] , 2, var))
subcensus_var<-data.frame(apply(census.subct[sapply(census.subct,is.numeric)],2,var))
names(census_var)[1]<-"County Variance"
names(subcensus_var)[1]<-"Sub-County Variance"
census_var
subcensus_var
We can then move on to observing the means on the subcounty and on the county level.
meanofcensus.ct<-data.frame(apply(as.matrix(census.ct[, 3:27]), 2, mean))
meanofcensus.subct<-data.frame(apply(as.matrix(census.subct[, 3:27]), 2, mean))
names(meanofcensus.ct)[1]<-"County Means"
names(meanofcensus.subct)[1]<-"Sub-County Means"
meanofcensus.ct
meanofcensus.subct
We can observe the variances of the data and see that they are not all the same, there are some variances that are by orders of magnitude larger than the rest. If we proceed with the reduction of dimensionality without scaling the variables what may happen is that most of the principal component that we observe would be driven by those variables with larger variances. It is then a good idea to scale the variables before running our principal component analysis to reduce the dimensionality of our data. Further centering the data before running principal component can help us avoid possible misleading 1st principal components where it may appear that the the direction of principal component analysis is in one direction but it is in fact in another. The means on the subcounty and county level also seem to be pretty far apart which provides indication that the principal components that result from running PCA may be misleading as mentioned . Therefore for the data at both the county and the sub-county level it is best practice to both center and scale the data. If we have vastly different means then it is good practice to center, the scale controls the standard deviation and it is also good practice to account for large impacts of variance. Now we can run the principal component analysis:
# PCA on county-level data with first two principle components
county_model <- prcomp(subset(census.ct, select = -c(State, County)), center = TRUE, scale = TRUE)
ct.pc <- data.frame(county_model$rotation[, 1:2])
# subcounty PCA with first two principle components.
subct_model <- prcomp(subset(census.subct, select = -c(State, County, TotalPop, Weight)), center = TRUE, scale = TRUE)
subct.pc <- data.frame(subct_model$rotation[, 1:2])
# Grabbing the three features the largest values for PC1 in both county and subcounty models.
sort(abs(county_model$rotation[, 1]), decreasing = TRUE)[1:3]
sort(abs(subct_model$rotation[, 1]), decreasing = TRUE)[1:3]
# Checking the loading vectors for the PCA models.
#data.frame(county_model$rotation[, 1])
#data.frame(subct_model$rotation[, 1])
the 3 features with the largest absolute values of the first principal component on the county level are Poverty,ChildPoverty, and IncomePerCap and on the Sub-County level these are IncomePerCap, Professional, and Poverty. We observe that at the County the features that have opposite signs are Men, White, Citizen, Poverty, ChildPoverty, Service, Drive, Transit, OtherTransp, PrivateWork, FamilyWork, Unemployment, Minority and CountyTotal. At the Sub-County level the features that have opposite signs are Men, Women, White, Citizen, Office, Production, Drive, WorkAtHome, MeanCommute, SelfEmployed, FamilyWork, Unemployment and Minority If a feature has opposite signs, it does not change the variance within the first component as the weights also change the sign and therefore the interpretation remains the same. The interpretation stays the same because if you imagine a biplot and the image is rotated 180 degrees. The relation between the loadings and data points is exactly the same, therefore having the same interpretation. The signs of a PCA scores and loading are arbitrary since it can be flipped, but only if the sign of both its scores and loadings are changed at the same time.
We can determine the number of minimum number of PCs needed to capture 90% of the variance for both the county and sub-county analyses. Further we can visualize the proportion of variance explained (PVE) and cumulative PVE for both county and sub-county analyses as well.
ct.PVE <- county_model$sdev^2/sum(county_model$sdev^2)
# Calculating cumulative PVE for county-level data.
Cum_PVE <- cumsum(ct.PVE)
# Calculating PVE for subcounty-level data.
subct.PVE <- subct_model$sdev^2/sum(subct_model$sdev^2)
# Calculating cumulative PVE for subcounty-level data.
Cum_PVE1 <- cumsum(subct.PVE)
options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)
par(mfrow=c(2, 2))
plot(ct.PVE, type="l", lwd=3, main="PVE for County",cex.axis=3,cex.lab=2,cex.main=4)
plot(Cum_PVE, type="l", lwd=3,main=" Cumulative PVE for County",cex.axis=3,cex.lab=2, cex.main=4)
plot(subct.PVE, type="l", lwd=3, main="PVE for Subcounty",cex.axis=3,cex.lab=2,cex.main=4)
plot(Cum_PVE1, type="l", lwd=3,main=" Cumulative PVE for SUbcounty",cex.axis=3,cex.lab=2,cex.main=4)
# Finding number of principle components to explain at least 90% of the variation for county-level data.
which(Cum_PVE == min(Cum_PVE[Cum_PVE >= .9]))
# Finding number of principle components to explain at least 90% of the variation for subcounty-level data.
which(Cum_PVE1 == min(Cum_PVE1[Cum_PVE1 >= .9]))
For county, it takes 14 PC's to capture 90% of the variance.
For sub-county, it takes 15 Pc's to capture 90% of the variance.
With census.ct we can perform hierarchical clustering with complete linkage. We first run hierachical clustering without the most significant principal components and then re-run the hierarchical clustering algorithm using the first 5 principal components of ct.pc as inputs instead of the originald features. We do this by cutting our tree to partition the observations into 10 clusters. We can interpret the effectiveness of the methods by assessing which method better placed a random county. In our case we will use San Mateo as an indicator county.
oldw <- getOption("warn")
options(warn = -1)
ct_dist=dist(census.ct,method = "euclidean")
set.seed(4)
cluster=hclust(ct_dist^(1/4), method="complete")
dend.ct=as.dendrogram(cluster)
dend.ct=color_branches(dend.ct,k=10)
dend.ct=color_labels(dend.ct,k=10)
dend.ct=set(dend.ct,"labels_cex",0.85)
dend.ct=set_labels(dend.ct, labels=census.ct$Type[order.dendrogram(dend.ct)])
plot(dend.ct,horiz=T,main = "Dendrogram of census.ct Colored with 10 Clusters",cex.main=3,cex.axis=2)
options(warn = oldw)
Now we can visualize the second dendogram:
ct.pc1 =(county_model$x[,1:5])
ct.dist=dist(ct.pc1,method = "euclidean")
cluster1=hclust(ct.dist, method="complete")
dend.ct=as.dendrogram(cluster1)
dend.ct=color_branches(dend.ct,k=10)
dend.ct=color_labels(dend.ct,k=10)
dend.ct=set(dend.ct,"labels_cex",0.85)
dend.ct=set_labels(dend.ct, labels=census.ct$Type[order.dendrogram(dend.ct)])
plot(dend.ct,horiz=T,main = "Dendrogram of ct.pc Colored with 10 Clusters",cex.main=3,cex.axis=2)
SM<-cutree(cluster,10)
SMclust<-SM[which(census.ct$County=="San Mateo")]
SM1<-cutree(cluster1,10)
SMclust1<-SM1[which(census.ct$County=="San Mateo")]
plot(county_model$x, col = SM, main="Hierarchical Clustering on County", sub="clusters=10", cex.lab = 1.5,cex.main=3,cex.axis=2)
abline(v = ct.pc[SMclust,1], col = "blue")
plot(ct.pc1 ,col=SM1,
main="Hierarchical Clustering on County with 5 Principal Components", sub="clusters=10", cex.lab=1.5,cex.main=3,cex.axis=2)
abline(v = ct.pc[SMclust1,], col = "blue")
By running hierarchical clustering on census.ct, San Mateo county is place in the 2nd cluster.On the other hand,running hierarchical clustering on the principal components that we arrived at after running a principal component analysis places San Mateo county on the 7th cluster. The use of the principal components in hierarchical clustering provides a better classification of the San Mateo county. Principal component analysis provides us with the most impactful attributes that should be used to classify, by reducing the dimensionality through the distinction of the variances the data is reduced to a basis consisting of the most variable attributes. Therefore through the use of the principal components, when running hierarchical clustering we are clustering by the features that provide the most distinction between observations, that is, the characteristics that markedly define observation. So San Mateo is expected to be clustered in accordance with the most characterizing attributes found via PCA. On the other hand not using the principal components in hierarchical clustering leads to clustering without great consideration to the most characterizing attributes, this will lead San Mateo to be clustered with consideration to attributes that dont necessarily distinguish it and it can therefore be clustered with observations that cannot be said to be similar to San Mateo. Therefore San Mateo is not as appropriately clustered compared to when principal components are used.
Our data is now ready for classification so we commence by pruning the tree to minimize misclassification error using generated folds for the cross-validation. We then visualize the trees before and after pruning and assess the training and test errors and save these to a records variable.
tmpwinner <- county_winner %>% ungroup %>%
mutate(state = state.name[match(state, state.abb)]) %>% ## state abbreviations
mutate_at(vars(state, county), tolower) %>% ## to all lowercase
mutate(county = gsub(" county| columbia| city| parish", "", county)) ## remove suffixes
tmpcensus <- census.ct %>% mutate_at(vars(State, County), tolower)
election.cl <- tmpwinner %>%
left_join(tmpcensus, by = c("state"="State", "county"="County")) %>%
na.omit
## save meta information
election.meta <- election.cl %>% select(c(county, fips, state, votes, pct, total))
## save predictors and class labels
election.cl = election.cl %>% select(-c(county, fips, state, votes, pct, total))
set.seed(10)
n <- nrow(election.cl)
in.trn <- sample.int(n, 0.8*n)
trn.cl <- election.cl[ in.trn,]
tst.cl <- election.cl[-in.trn,]
set.seed(20)
nfold <- 10
folds <- sample(cut(1:nrow(trn.cl), breaks=nfold, labels=FALSE))
calc_error_rate = function(predicted.value, true.value){
return(mean(true.value!=predicted.value))
}
records = matrix(NA, nrow=4, ncol=2)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","logistic","lasso","Knn")
#Fit Tree model
tree=tree(candidate~.,data=trn.cl)
draw.tree(tree, nodeinfo = TRUE,cex=1)
set.seed(102)
# cross validation
cvtree=cv.tree(tree,FUN=prune.misclass, K=10,rand=folds)
#determine best size
best.size.cv=cvtree$size[max(which(cvtree$dev==min(cvtree$dev)))]
#Prune Tree to a tree with only 10 leaf nodes
prune=prune.tree(tree,best=best.size.cv, method = "misclass")
draw.tree(prune, nodeinfo = TRUE,cex=1.2)
#Training Error
records = matrix(NA, nrow=4, ncol=2)
set.seed(98)
pred.prune.train = predict(prune,trn.cl,type="class")
records[1,1]<-calc_error_rate(pred.prune.train, trn.cl$candidate)
#Test Error
pred.prune.test=predict(prune,tst.cl,type="class")
records[1,2]<-calc_error_rate(pred.prune.test, tst.cl$candidate)
When comparing the two, we recognized how sparse and how many more nodes the unpruned tree possesses in comparsion to the pruned tree. The unpruned tree has 27 total nodes as the pruned tree has nearly half with 13 nodes. By pruning the tree we are able to reduce the risk of overfitting while maintaining as much information as possible. The total classified correct percetange also only decreased from 93.7% to 93.4%, preserving as much information as possible and simplifying the tree. After pruning, our decision tree looks a lot easier to understand and allows us create a clear story that the audience may follow.
In the pruned tree, we observed that Donald Trump had dominated Clinton, he dominated in counties of greater White population and where people don't take public transportation. While Clinton, although struggling, leads Trump in counties of lower income, people that use public transport, and in larger counties.
We run a logistic regression to predict the winning candidate in each county and save the training and test errors into the records object into which we inputted our test and training errors from the last section and assess our results.
oldw <- getOption("warn")
options(warn = -1)
trn.cl$candidate <- factor(trn.cl$candidate, levels=c("Donald Trump", "Hillary Clinton"))
tst.cl$candidate <- factor(tst.cl$candidate, levels=c("Donald Trump", "Hillary Clinton"))
glm.fit=glm(candidate~.,data=trn.cl,family=binomial)
print(summary(glm.fit))
#Training Error
prob.training = round(predict(glm.fit, type="response"),digits=5)
trn.cl1 = trn.cl %>%
mutate(predcand=as.factor(ifelse(prob.training<=0.5, "Donald Trump", "Hillary Clinton")))
records[2,1]<-calc_error_rate(trn.cl1$predcand,trn.cl$candidate)
#Test Error
prob.test = round(predict(glm.fit, tst.cl, type="response"),digits=5)
tst.cl1 = tst.cl %>%
mutate(predcand=as.factor(ifelse(prob.test<=0.5, "Donald Trump", "Hillary Clinton")))
records[2,2]<-calc_error_rate(tst.cl1$predcand,tst.cl$candidate)
options(warn = oldw)
The Significant variables: White, IncomePerCapErr, WorkAtHome, MeanCommute, FamilyWork
The logistic regression is not consistent with the decision tree analysis as we different signifcant variables. For every one unit change in White, the log odds of winning the county (vs. not winning the county) decreaes by -1.689e-01, holding other variables fixed. For every one unit change in IncomePerCapErr, the log odds of winning the county (vs. not winning the county) decreaes by -3.622e-04, holding other variables fixed.
Assessing our logistic regression model fit (glm.fit()) we can see that the, fitted probabilities at times are numerically 0 or 1. This is an indication that we have perfect separation (some linear combination of variables perfectly predicts the winner). This is usually a sign that we are overfitting. One way to control overfitting in logistic regression is through regularization. We can use cv.glmnet function from the glmnet library to run K-fold cross validation and select the best regularization parameter for the logistic regression with LASSO penalty.We set alpha=1 to run LASSO regression and set lambda = c(1, 5, 10, 50) * 1e-4 in cv.glmnet() function to set pre-defined candidate values for the tuning parameter $\lambda$.This is because the default candidate values of $\lambda$ in cv.glmnet() is relatively too large for our dataset thus we use pre-defined candidate values. We save training and test errors to the records variable.
x <- model.matrix(candidate~.,trn.cl)[,-1]
y <- ifelse(trn.cl$candidate == "Hillary Clinton",1,0)
cvlasso<-cv.glmnet(x,y, alpha = 1, lambda = c(1, 5, 10, 50) * 1e-4)
cvlasso
lambda_glmfit = glmnet(x,y,family = "binomial", alpha = 1, lambda=cvlasso$lambda.1se)
coef(lambda_glmfit)
fficients.
The optimal lambda is 0.005 because it gives a much simpler model (only 13 predictors) and still maintains a high level of accuracy
The 13 nonzero coefficients for the penalized fit are White, Citizen, Poverty, Professional, Service, Drive, Carpool, Transit, OtherTransp, Employed, PrivateWork, FamilyWork, and Unemployment. This contrasts with the unpenalized fit which has all 20 nonzero coefficients.
#LASSO FIT
#Training Error
prob.training1 = round(predict(lambda_glmfit,x, type="response"),digits=5)
trn.cl2 = trn.cl %>%
mutate(predcand=as.factor(ifelse(prob.training1<=0.5, "Donald Trump", "Hillary Clinton")))
records[3,1]<-calc_error_rate(trn.cl2$predcand,trn.cl$candidate)
#Test Error
x.test <- model.matrix(candidate~.,tst.cl)[,-1]
prob.test1 = round(predict(lambda_glmfit, x.test, type="response"),digits=5)
tst.cl2 = tst.cl %>%
mutate(predcand=as.factor(ifelse(prob.test1<=0.5, "Donald Trump", "Hillary Clinton")))
records[3,2]<-calc_error_rate(tst.cl2$predcand,tst.cl$candidate)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","logistic","lasso","KNN")
records_final<-records[-4,]
records_final
dim(prob.training1)
We plot the ROC curves ROC curves for the decision tree, logistic regression and LASSO logistic regression using predictions on the test data.
pred.prune.test1=predict(prune,tst.cl,type="vector")[,-1]
#Predict using logisitic regression fit
prob.test = round(predict(glm.fit, tst.cl, type="response"),digits=5)
#Predict using LASSO logisitic regression fit
prob.test1 = round(predict(lambda_glmfit, x.test, type="response"),digits=5)
#Constructing ROC Curves
pred = prediction(pred.prune.test1, tst.cl[["candidate"]])
perf = performance(pred, measure="tpr", x.measure="fpr")
pred1 = prediction(prob.test,tst.cl$candidate)
perf1 = performance(pred1, measure="tpr", x.measure="fpr")
pred2 = prediction(prob.test1,tst.cl$candidate)
perf2 = performance(pred2, measure="tpr", x.measure="fpr")
#Plotting all curves
plot(perf, col=2, lwd=3, main="ROC curves for Logistic Regression Fit, Lasso Logistic Regression Fit and Tree Fit")
abline(0,1)
par(new=TRUE)
plot(perf1, col=4, lwd=3)
par(new=TRUE)
plot(perf2, col=3, lwd=3)
legend(0.7,0.3,legend=c("Tree Fit","Logistic Regression Fit","Lasso Logistic Regression Fit"),col=c(2,4,3),lty=1,cex=0.6)

Decision trees are a non-parametric model that are generally very easy to interpret. It's graph is very intuitive and not sensitive to monotone transformations of predictors. But, it's predictive accurcary is not competitive with some other best supervised learning approaches. It also tends to be overfitting with a very high variance as a small change in the data might lead to learning a very different tree.
Logitic Regression is favored due to the fact that it's very efficient, no tuning required, and works better when you remove attributes that are unrelated to the output and attributes that are correlated to each other. It represents a good starting baseline that you can utilize in order to analyze the performance of much more complex models. However, logistic regression cannot solve non-linear problems as it possesses a linear decision surface. As well as its gets outperformed by more complex algorithms and is only a useful tool if all the important independent variables have already been identified.
LASSO logistic Regression is an automaic method that selects features by shrinking co-efficient towards zero to perform variable selection and regularization to improve the prediction accuracy and interpretability of the statistical model it calculates. It outperforms other methods of automatic variable selection providing better results, but it can produce models that don't make sense, ignore important variables, and does not follow the hierarchy principle.
The different classifiers cater to specific questions where the data will cooperate with the classifier in order to produce the best result. Each classifier possesses its own strengths and purpose as decision trees allow audiences to easily follow a "story", but are not as powerful as other models when it comes to predictions. Especially when the question asked is requires a certain data set, the data does not change but utilizing different classifiers will help achieve the most accurate results.
As found when conducting research for this analysis, elections are extremely difficult to predict accurately due to small changes and errors that can result in a compltely incorrect prediction. Elections possess a significant amount of factors that allow for predictions, but sentiment can quickly change from one candidate to another requiring models to base their predictions on the weight of these factors. This research project allowed me and my partner to analyze which factors in particular hold the most weight in trying to calculate the best predictions possible.
In the visualizations of the poverty percentage for each state we hoped to analyze if there was any trends in regards to the population of Blacks and Whites for each state. By examining the white population and poverty levels we can connect it with our decision tree analysis to capture an idea of which states, and their respective counties, Trump or Clinton will win. In states of higher white population, we can expect Trump will a lot of counties due to the White population percentages observed and the observations from the decision tree.
Out of the 3 methods explored we determined that the logistic regression method worked the best as it possessed the highest true positive rate and the lowest training error compared to LASSO logistic regression and decision trees. From our ROC curves we observed that the tree fit has the lowest AUC value as it sits below the logistic regression fit and LASSO logistic regression fit. The logistic regression fit and LASSO logistic regression fit on the ROC graph are both very similiar, but due to the lower training error of the logistic regression, we favor it over the LASSO logistic regression. Also since, we removed any unnecessary and unrelated attributes, this allows for the logistic regression method to work efficiently over the other methods.
We could improve our analysis by incorporating data from 2008 and 2012 in order to see what discrepencies there may be between the 3 elections. As well, it would improve our models as the extra data would help prove if the factors from the 2016 elections held in the 2008 and 2012 elections as well. If they differed then we would discover that the 2016 election may have been a unique circumstance in which would not be accurate when predicting a "normal" election. Data from other senate election may also help improve our analysis as we could examine the voting behavior of those based on party affiliation and if the same factors hold similiar weights to the presidential election.
library(class)
library(reshape2)
#exploring additional classification methods:
#knn
Ytrain <- trn.cl$candidate
Xtrain <- trn.cl %>% select(-candidate)
Ytest <- tst.cl$candidate
Xtest <- tst.cl %>% select(-candidate)
# do.chunk() for k-fold Cross-validation
do.chunk <- function(chunkid, folddef, Xdat, Ydat, ...){ # Function arguments
train = (folddef!=chunkid) # Get training index
Xtr = Xdat[train,] # Get training set by the above index
Ytr = Ydat[train] # Get true labels in training set
Xvl = Xdat[!train,] # Get validation set
Yvl = Ydat[!train] # Get true labels in validation set
predYtr = knn(train=Xtr, test=Xtr, cl=Ytr, ...) # Predict training labels
predYvl = knn(train=Xtr, test=Xvl, cl=Ytr, ...) # Predict validation labels
data.frame(fold = chunkid, # k folds
train.error = mean(predYtr != Ytr), # Training error for each fold
val.error = mean(predYvl != Yvl)) # Validation error for each fold
}
# creating a vector of possible k values
kvec <- c(1, seq(10, 50, length.out=9))
kerrors <- NULL
#cross validation
for (j in kvec) {
tve <- plyr::ldply(1:nfold, do.chunk, folddef=folds,
Xdat= Xtrain, Ydat=Ytrain, k=j)
tve$neighbors <- j
kerrors <- rbind(kerrors, tve)
}
# calculating test errors at each k
errors <- melt(kerrors, id.vars=c("fold","neighbors"), value.name="error")
val.error.means <- errors %>%
filter(variable=="val.error") %>%
group_by(neighbors) %>%
summarise_at(vars(error),funs(mean))
# picking the best k
min.error <- val.error.means %>%
filter(error==min(error))
bestk <- max(min.error$neighbors)
bestk
# calculating training errors at each k
# (by taking mean of each cv result)
train.error.means <- errors %>%
filter(variable=="train.error") %>%
group_by(neighbors) %>%
summarise_at(vars(error),funs(mean))
pred.Ytrain <- knn(train = Xtrain, test = Xtrain, cl = Ytrain, k = bestk)
records[4,1] <- calc_error_rate(pred.Ytrain, Ytrain)
pred.Ytrain <- knn(train=Xtrain, test=Xtest, cl=Ytrain, k=bestk)
records[4,2] <- calc_error_rate(pred.Ytrain, Ytest)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","logistic","lasso", "KNN")
records
We observe that the best k in this instance is k=25. We output the records object onto which we aggregated Knn values. After comparing KNN to the test and training errors of the decision tree, logisitc regression, and LASSO logistic regression it possesses both the largest test error as well as training error out of the four. The logistic regression possesses the lowest training error, but the decision tree owns the lowest test error out of all the methods. KNN got outperformed versus the other methods due to the fact that it's nonparametric and no assumptoins are made about the shape of the decision boundary. However, from this analysis logistic regression dominates it as KNN only performs well versus it when the decision boundary is highly non-linear. In addition KNN doesn't take into account which predictors are important. Both Logistic regression and the decision tree both have similar training errors with the logistic regression having a smaller one, but the decision tree owns a smaller test error by 0.005 making it more favorable especially since decision trees are easy to interpret for the audience.